{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 机器学习算法分类\n",
    "- 监督学习\n",
    "    - 分类：k-近邻算法、贝叶斯分类、决策树与随机森林、逻辑回归、神经网络\n",
    "    - 回归：线性回归、岭回归、神经网络\n",
    "    - 标注：隐马尔可夫模型     (不做要求)\n",
    "- 无监督学习（没有标签值）\n",
    "    - 聚类：k-means（有几十种聚类）\n"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## 监督学习\n",
    "（英语：Supervised learning），可以由输入数据中学到或建立一个模型，并依此模式推测新的结果。输入数据是由输入特征值和目标值所组成。函数的输出可以是一个连续的值（称为回归），或是输出是有限个离散值（称作分类）。\n",
    "## 无监督学习\n",
    "（英语：Supervised learning），可以由输入数据中学到或建立一个模型，并依此模式推测新的结果。输入数据是由输入特征值所组成。\n"
   ]
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.719664Z",
     "start_time": "2025-01-22T03:52:34.356040Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import time\n",
    "\n",
    "from sklearn.datasets import load_iris, fetch_20newsgroups, fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split, GridSearchCV\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.tree import DecisionTreeClassifier, export_graphviz\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.metrics import roc_auc_score"
   ],
   "outputs": [],
   "execution_count": 2
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 分类问题\n",
    "- 分类是监督学习的一个核心问题，在监督学习中，当输出变量取有限个离散值时，预测问题变成为分类问题。最基础的便是二分类问题，即判断是非，从两个类别中选择一个作为预测结果；\n",
    "- 概念：回归是监督学习的另一个重要问题。回归用于预测输入变量和输出变量之间的关系，输出是连续型的值。\n"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "load直接加载在内存的，数据集比较小，并不会保存到本地磁盘\n",
    "\n",
    "fetch数据集比较大，下载下来后会存在本地磁盘，下一次就不会再连接sklearn的服务器\n"
   ]
  },
  {
   "cell_type": "code",
   "source": [
    "#鸢尾花数据集，查看特征，目标，样本量\n",
    "li = load_iris()\n",
    "\n",
    "print(\"获取特征值\")\n",
    "print(type(li.data))\n",
    "print(li.data.shape) # 150个样本，4个特征,一般看shape\n",
    "li.data"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.732206Z",
     "start_time": "2025-01-22T03:52:35.720694Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "获取特征值\n",
      "<class 'numpy.ndarray'>\n",
      "(150, 4)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[5.1, 3.5, 1.4, 0.2],\n",
       "       [4.9, 3. , 1.4, 0.2],\n",
       "       [4.7, 3.2, 1.3, 0.2],\n",
       "       [4.6, 3.1, 1.5, 0.2],\n",
       "       [5. , 3.6, 1.4, 0.2],\n",
       "       [5.4, 3.9, 1.7, 0.4],\n",
       "       [4.6, 3.4, 1.4, 0.3],\n",
       "       [5. , 3.4, 1.5, 0.2],\n",
       "       [4.4, 2.9, 1.4, 0.2],\n",
       "       [4.9, 3.1, 1.5, 0.1],\n",
       "       [5.4, 3.7, 1.5, 0.2],\n",
       "       [4.8, 3.4, 1.6, 0.2],\n",
       "       [4.8, 3. , 1.4, 0.1],\n",
       "       [4.3, 3. , 1.1, 0.1],\n",
       "       [5.8, 4. , 1.2, 0.2],\n",
       "       [5.7, 4.4, 1.5, 0.4],\n",
       "       [5.4, 3.9, 1.3, 0.4],\n",
       "       [5.1, 3.5, 1.4, 0.3],\n",
       "       [5.7, 3.8, 1.7, 0.3],\n",
       "       [5.1, 3.8, 1.5, 0.3],\n",
       "       [5.4, 3.4, 1.7, 0.2],\n",
       "       [5.1, 3.7, 1.5, 0.4],\n",
       "       [4.6, 3.6, 1. , 0.2],\n",
       "       [5.1, 3.3, 1.7, 0.5],\n",
       "       [4.8, 3.4, 1.9, 0.2],\n",
       "       [5. , 3. , 1.6, 0.2],\n",
       "       [5. , 3.4, 1.6, 0.4],\n",
       "       [5.2, 3.5, 1.5, 0.2],\n",
       "       [5.2, 3.4, 1.4, 0.2],\n",
       "       [4.7, 3.2, 1.6, 0.2],\n",
       "       [4.8, 3.1, 1.6, 0.2],\n",
       "       [5.4, 3.4, 1.5, 0.4],\n",
       "       [5.2, 4.1, 1.5, 0.1],\n",
       "       [5.5, 4.2, 1.4, 0.2],\n",
       "       [4.9, 3.1, 1.5, 0.2],\n",
       "       [5. , 3.2, 1.2, 0.2],\n",
       "       [5.5, 3.5, 1.3, 0.2],\n",
       "       [4.9, 3.6, 1.4, 0.1],\n",
       "       [4.4, 3. , 1.3, 0.2],\n",
       "       [5.1, 3.4, 1.5, 0.2],\n",
       "       [5. , 3.5, 1.3, 0.3],\n",
       "       [4.5, 2.3, 1.3, 0.3],\n",
       "       [4.4, 3.2, 1.3, 0.2],\n",
       "       [5. , 3.5, 1.6, 0.6],\n",
       "       [5.1, 3.8, 1.9, 0.4],\n",
       "       [4.8, 3. , 1.4, 0.3],\n",
       "       [5.1, 3.8, 1.6, 0.2],\n",
       "       [4.6, 3.2, 1.4, 0.2],\n",
       "       [5.3, 3.7, 1.5, 0.2],\n",
       "       [5. , 3.3, 1.4, 0.2],\n",
       "       [7. , 3.2, 4.7, 1.4],\n",
       "       [6.4, 3.2, 4.5, 1.5],\n",
       "       [6.9, 3.1, 4.9, 1.5],\n",
       "       [5.5, 2.3, 4. , 1.3],\n",
       "       [6.5, 2.8, 4.6, 1.5],\n",
       "       [5.7, 2.8, 4.5, 1.3],\n",
       "       [6.3, 3.3, 4.7, 1.6],\n",
       "       [4.9, 2.4, 3.3, 1. ],\n",
       "       [6.6, 2.9, 4.6, 1.3],\n",
       "       [5.2, 2.7, 3.9, 1.4],\n",
       "       [5. , 2. , 3.5, 1. ],\n",
       "       [5.9, 3. , 4.2, 1.5],\n",
       "       [6. , 2.2, 4. , 1. ],\n",
       "       [6.1, 2.9, 4.7, 1.4],\n",
       "       [5.6, 2.9, 3.6, 1.3],\n",
       "       [6.7, 3.1, 4.4, 1.4],\n",
       "       [5.6, 3. , 4.5, 1.5],\n",
       "       [5.8, 2.7, 4.1, 1. ],\n",
       "       [6.2, 2.2, 4.5, 1.5],\n",
       "       [5.6, 2.5, 3.9, 1.1],\n",
       "       [5.9, 3.2, 4.8, 1.8],\n",
       "       [6.1, 2.8, 4. , 1.3],\n",
       "       [6.3, 2.5, 4.9, 1.5],\n",
       "       [6.1, 2.8, 4.7, 1.2],\n",
       "       [6.4, 2.9, 4.3, 1.3],\n",
       "       [6.6, 3. , 4.4, 1.4],\n",
       "       [6.8, 2.8, 4.8, 1.4],\n",
       "       [6.7, 3. , 5. , 1.7],\n",
       "       [6. , 2.9, 4.5, 1.5],\n",
       "       [5.7, 2.6, 3.5, 1. ],\n",
       "       [5.5, 2.4, 3.8, 1.1],\n",
       "       [5.5, 2.4, 3.7, 1. ],\n",
       "       [5.8, 2.7, 3.9, 1.2],\n",
       "       [6. , 2.7, 5.1, 1.6],\n",
       "       [5.4, 3. , 4.5, 1.5],\n",
       "       [6. , 3.4, 4.5, 1.6],\n",
       "       [6.7, 3.1, 4.7, 1.5],\n",
       "       [6.3, 2.3, 4.4, 1.3],\n",
       "       [5.6, 3. , 4.1, 1.3],\n",
       "       [5.5, 2.5, 4. , 1.3],\n",
       "       [5.5, 2.6, 4.4, 1.2],\n",
       "       [6.1, 3. , 4.6, 1.4],\n",
       "       [5.8, 2.6, 4. , 1.2],\n",
       "       [5. , 2.3, 3.3, 1. ],\n",
       "       [5.6, 2.7, 4.2, 1.3],\n",
       "       [5.7, 3. , 4.2, 1.2],\n",
       "       [5.7, 2.9, 4.2, 1.3],\n",
       "       [6.2, 2.9, 4.3, 1.3],\n",
       "       [5.1, 2.5, 3. , 1.1],\n",
       "       [5.7, 2.8, 4.1, 1.3],\n",
       "       [6.3, 3.3, 6. , 2.5],\n",
       "       [5.8, 2.7, 5.1, 1.9],\n",
       "       [7.1, 3. , 5.9, 2.1],\n",
       "       [6.3, 2.9, 5.6, 1.8],\n",
       "       [6.5, 3. , 5.8, 2.2],\n",
       "       [7.6, 3. , 6.6, 2.1],\n",
       "       [4.9, 2.5, 4.5, 1.7],\n",
       "       [7.3, 2.9, 6.3, 1.8],\n",
       "       [6.7, 2.5, 5.8, 1.8],\n",
       "       [7.2, 3.6, 6.1, 2.5],\n",
       "       [6.5, 3.2, 5.1, 2. ],\n",
       "       [6.4, 2.7, 5.3, 1.9],\n",
       "       [6.8, 3. , 5.5, 2.1],\n",
       "       [5.7, 2.5, 5. , 2. ],\n",
       "       [5.8, 2.8, 5.1, 2.4],\n",
       "       [6.4, 3.2, 5.3, 2.3],\n",
       "       [6.5, 3. , 5.5, 1.8],\n",
       "       [7.7, 3.8, 6.7, 2.2],\n",
       "       [7.7, 2.6, 6.9, 2.3],\n",
       "       [6. , 2.2, 5. , 1.5],\n",
       "       [6.9, 3.2, 5.7, 2.3],\n",
       "       [5.6, 2.8, 4.9, 2. ],\n",
       "       [7.7, 2.8, 6.7, 2. ],\n",
       "       [6.3, 2.7, 4.9, 1.8],\n",
       "       [6.7, 3.3, 5.7, 2.1],\n",
       "       [7.2, 3.2, 6. , 1.8],\n",
       "       [6.2, 2.8, 4.8, 1.8],\n",
       "       [6.1, 3. , 4.9, 1.8],\n",
       "       [6.4, 2.8, 5.6, 2.1],\n",
       "       [7.2, 3. , 5.8, 1.6],\n",
       "       [7.4, 2.8, 6.1, 1.9],\n",
       "       [7.9, 3.8, 6.4, 2. ],\n",
       "       [6.4, 2.8, 5.6, 2.2],\n",
       "       [6.3, 2.8, 5.1, 1.5],\n",
       "       [6.1, 2.6, 5.6, 1.4],\n",
       "       [7.7, 3. , 6.1, 2.3],\n",
       "       [6.3, 3.4, 5.6, 2.4],\n",
       "       [6.4, 3.1, 5.5, 1.8],\n",
       "       [6. , 3. , 4.8, 1.8],\n",
       "       [6.9, 3.1, 5.4, 2.1],\n",
       "       [6.7, 3.1, 5.6, 2.4],\n",
       "       [6.9, 3.1, 5.1, 2.3],\n",
       "       [5.8, 2.7, 5.1, 1.9],\n",
       "       [6.8, 3.2, 5.9, 2.3],\n",
       "       [6.7, 3.3, 5.7, 2.5],\n",
       "       [6.7, 3. , 5.2, 2.3],\n",
       "       [6.3, 2.5, 5. , 1.9],\n",
       "       [6.5, 3. , 5.2, 2. ],\n",
       "       [6.2, 3.4, 5.4, 2.3],\n",
       "       [5.9, 3. , 5.1, 1.8]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 3
  },
  {
   "cell_type": "code",
   "source": [
    "print(\"目标值\")\n",
    "print(li.target)\n",
    "print('-' * 50)\n",
    "print(li.DESCR)\n",
    "print('-' * 50)\n",
    "print(li.feature_names)  # 重点,特征名字\n",
    "print('-' * 50)\n",
    "print(li.target_names) # 目标名字"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.736377Z",
     "start_time": "2025-01-22T03:52:35.732206Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "目标值\n",
      "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
      " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2]\n",
      "--------------------------------------------------\n",
      ".. _iris_dataset:\n",
      "\n",
      "Iris plants dataset\n",
      "--------------------\n",
      "\n",
      "**Data Set Characteristics:**\n",
      "\n",
      ":Number of Instances: 150 (50 in each of three classes)\n",
      ":Number of Attributes: 4 numeric, predictive attributes and the class\n",
      ":Attribute Information:\n",
      "    - sepal length in cm\n",
      "    - sepal width in cm\n",
      "    - petal length in cm\n",
      "    - petal width in cm\n",
      "    - class:\n",
      "            - Iris-Setosa\n",
      "            - Iris-Versicolour\n",
      "            - Iris-Virginica\n",
      "\n",
      ":Summary Statistics:\n",
      "\n",
      "============== ==== ==== ======= ===== ====================\n",
      "                Min  Max   Mean    SD   Class Correlation\n",
      "============== ==== ==== ======= ===== ====================\n",
      "sepal length:   4.3  7.9   5.84   0.83    0.7826\n",
      "sepal width:    2.0  4.4   3.05   0.43   -0.4194\n",
      "petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n",
      "petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n",
      "============== ==== ==== ======= ===== ====================\n",
      "\n",
      ":Missing Attribute Values: None\n",
      ":Class Distribution: 33.3% for each of 3 classes.\n",
      ":Creator: R.A. Fisher\n",
      ":Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n",
      ":Date: July, 1988\n",
      "\n",
      "The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\n",
      "from Fisher's paper. Note that it's the same as in R, but not as in the UCI\n",
      "Machine Learning Repository, which has two wrong data points.\n",
      "\n",
      "This is perhaps the best known database to be found in the\n",
      "pattern recognition literature.  Fisher's paper is a classic in the field and\n",
      "is referenced frequently to this day.  (See Duda & Hart, for example.)  The\n",
      "data set contains 3 classes of 50 instances each, where each class refers to a\n",
      "type of iris plant.  One class is linearly separable from the other 2; the\n",
      "latter are NOT linearly separable from each other.\n",
      "\n",
      ".. dropdown:: References\n",
      "\n",
      "  - Fisher, R.A. \"The use of multiple measurements in taxonomic problems\"\n",
      "    Annual Eugenics, 7, Part II, 179-188 (1936); also in \"Contributions to\n",
      "    Mathematical Statistics\" (John Wiley, NY, 1950).\n",
      "  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n",
      "    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n",
      "  - Dasarathy, B.V. (1980) \"Nosing Around the Neighborhood: A New System\n",
      "    Structure and Classification Rule for Recognition in Partially Exposed\n",
      "    Environments\".  IEEE Transactions on Pattern Analysis and Machine\n",
      "    Intelligence, Vol. PAMI-2, No. 1, 67-71.\n",
      "  - Gates, G.W. (1972) \"The Reduced Nearest Neighbor Rule\".  IEEE Transactions\n",
      "    on Information Theory, May 1972, 431-433.\n",
      "  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al\"s AUTOCLASS II\n",
      "    conceptual clustering system finds 3 classes in the data.\n",
      "  - Many, many more ...\n",
      "\n",
      "--------------------------------------------------\n",
      "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n",
      "--------------------------------------------------\n",
      "['setosa' 'versicolor' 'virginica']\n"
     ]
    }
   ],
   "execution_count": 4
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.742784Z",
     "start_time": "2025-01-22T03:52:35.737403Z"
    }
   },
   "cell_type": "code",
   "source": [
    "print(li.data.shape)\n",
    "li.target.shape"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(150,)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 5
  },
  {
   "cell_type": "code",
   "source": [
    "# 注意返回值, 训练集 train  x_train, y_train        测试集  test   x_test, y_test，顺序千万别搞错了\n",
    "# 默认是乱序的,random_state为了确保两次的随机策略一致，就会得到相同的随机数据，往往会带上\n",
    "x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25, random_state=1) # 0.25表示测试集占比\n",
    "\n",
    "# print(\"训练集特征值和目标值：\", x_train, y_train)\n",
    "print(\"训练集特征值shape\", x_train.shape)\n",
    "print('-'*50)\n",
    "# print(\"测试集特征值和目标值：\", x_test, y_test)\n",
    "print(\"测试集特征值shape\", x_test.shape)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.749154Z",
     "start_time": "2025-01-22T03:52:35.742784Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集特征值shape (112, 4)\n",
      "--------------------------------------------------\n",
      "测试集特征值shape (38, 4)\n"
     ]
    }
   ],
   "execution_count": 6
  },
  {
   "cell_type": "code",
   "source": [
    "150*0.25"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2025-01-22T03:52:35.754124Z",
     "start_time": "2025-01-22T03:52:35.749154Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "37.5"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 7
  },
  {
   "cell_type": "code",
   "source": [
    "# 下面是比较大的数据，需要下载一会，20类新闻\n",
    "# subset代表下载的数据集类型，默认是train，只有训练集\n",
    "news = fetch_20newsgroups(subset='all', data_home='data')\n",
    "# print(news.feature_names)  #这个数据集是没有的，因为没有特征，只有文本数据\n",
    "# print(news.DESCR)\n",
    "print('第一个样本')\n",
    "print(news.data[0])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    },
    "ExecuteTime": {
     "end_time": "2025-01-22T03:53:11.789581Z",
     "start_time": "2025-01-22T03:52:35.754124Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第一个样本\n",
      "From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\n",
      "Subject: Pens fans reactions\n",
      "Organization: Post Office, Carnegie Mellon, Pittsburgh, PA\n",
      "Lines: 12\n",
      "NNTP-Posting-Host: po4.andrew.cmu.edu\n",
      "\n",
      "\n",
      "\n",
      "I am sure some bashers of Pens fans are pretty confused about the lack\n",
      "of any kind of posts about the recent Pens massacre of the Devils. Actually,\n",
      "I am  bit puzzled too and a bit relieved. However, I am going to put an end\n",
      "to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\n",
      "are killing those Devils worse than I thought. Jagr just showed you why\n",
      "he is much better than his regular season stats. He is also a lot\n",
      "fo fun to watch in the playoffs. Bowman should let JAgr have a lot of\n",
      "fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\n",
      "regular season game.          PENS RULE!!!\n",
      "\n",
      "\n"
     ]
    }
   ],
   "execution_count": 8
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-22T03:53:11.794199Z",
     "start_time": "2025-01-22T03:53:11.791087Z"
    }
   },
   "cell_type": "code",
   "source": [
    "print('特征类型')\n",
    "print(type(news.data))\n",
    "print('-' * 50)\n",
    "print(news.target[0:15])\n",
    "from pprint import pprint\n",
    "pprint(list(news.target_names))"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "特征类型\n",
      "<class 'list'>\n",
      "--------------------------------------------------\n",
      "[10  3 17  3  4 12  4 10 10 19 19 11 19 13  0]\n",
      "['alt.atheism',\n",
      " 'comp.graphics',\n",
      " 'comp.os.ms-windows.misc',\n",
      " 'comp.sys.ibm.pc.hardware',\n",
      " 'comp.sys.mac.hardware',\n",
      " 'comp.windows.x',\n",
      " 'misc.forsale',\n",
      " 'rec.autos',\n",
      " 'rec.motorcycles',\n",
      " 'rec.sport.baseball',\n",
      " 'rec.sport.hockey',\n",
      " 'sci.crypt',\n",
      " 'sci.electronics',\n",
      " 'sci.med',\n",
      " 'sci.space',\n",
      " 'soc.religion.christian',\n",
      " 'talk.politics.guns',\n",
      " 'talk.politics.mideast',\n",
      " 'talk.politics.misc',\n",
      " 'talk.religion.misc']\n"
     ]
    }
   ],
   "execution_count": 9
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-01-22T03:53:11.800165Z",
     "start_time": "2025-01-22T03:53:11.794199Z"
    }
   },
   "cell_type": "code",
   "source": "len(news.target_names)",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "20"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 10
  },
  {
   "cell_type": "code",
   "source": [
    "print('-' * 50)\n",
    "print(len(news.data))\n",
    "print('新闻所有的标签')\n",
    "print(news.target)\n",
    "print('-' * 50)\n",
    "print(min(news.target), max(news.target))"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2025-01-22T03:53:11.806976Z",
     "start_time": "2025-01-22T03:53:11.800165Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------------------------------------\n",
      "18846\n",
      "新闻所有的标签\n",
      "[10  3 17 ...  3  1  7]\n",
      "--------------------------------------------------\n",
      "0 19\n"
     ]
    }
   ],
   "execution_count": 11
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
